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Abstract — This paper focuses on key steps in video analysis 
i.e. Detection of moving objects of interest and tracking of 
such objects from frame to frame. The object shape 
representations commonly employed for tracking are first 
reviewed and the criterion of feature Selection for tracking is 
discussed. Various object detection and tracking approaches 
are compared and analyzed. 

Index Terms — feature selection, image segmentation, object 
representation .point tracking 

I. Introduction 

Videos are actually sequences of images, each of which 
called a frame, displayed in fast enough frequency so that 
human eyes can percept the continuity of its content. It is 
obvious that all image processing techniques can be applied 
to individual frames. Besides, the contents of two consecutive 
frames are usually closely related. 

Visual content can be modeled as a hierarchy of 
abstractions. At the first level are the raw pixels with color or 
brightness information. Further processing yields features 
such as edges, corners, lines, curves, and color regions. A 
higher abstraction layer may combine and interpret these 
features as objects and their attributes. At the highest level 
are the human level concepts involving one or more objects 
and relationships among them 

Object detection in videos involves verifying the 
presence of an object in image sequences and possibly 
locating it precisely for recognition. Object tracking is to 
monitor objects spatial and temporal changes during a video 
sequence, including its presence, position, size, shape, etc. 

This is done by solving the temporal correspondence 
problem, the problem of matching the target region in 
successive frames of a sequence of images taken at closely- 
spaced time intervals. These two processes are closely related 
because tracking usually starts with detecting objects, while 
detecting an object repeatedly in subsequent image sequence 
is often necessary to help and verify tracking. 

U. Object Detection and Tracking Approaches 
A. Object Representation 



In a tracking scenario, an object can be defined as anything 
that is of interest for further analysis. For instance, boats on 
the sea, fish inside an aquarium, vehicles on a road, planes in 
the air, people walking on a road, or bubbles in the water are 
a set of objects that may be important to track in a specific 
domain. Objects can be represented by their shapes and 
appearances. In this section, we will first describe the object 
shape representations commonly employed for tracking and 
then address the joint shape and appearance representations. 

— Points. The object is represented by a point, that is, the 
centroid (Figure l(a))[2] or by a set of points (Figure 1(b)) [3]. 

In general, the point representation is suitable for tracking 
objects that occupy small regions in an image. 

— Primitive geometric shapes. Object shape is represented 
by a rectangle, ellipse (Figure 1(c), (d) [4]. Object motion for 
such representations is usually modeled by translation, affine, 
or projective (homography) transformation. Though primitive 
geometric shapes are more suitable for representing simple 
rigid objects, they are also used for tracking non rigid objects. 

— Object silhouette and contour. Contour representation 
defines the boundary of an object (Figure 1 (g), (h). The region 
inside the contour is called the silhouette of the object (see 
Figure l(i) ). Silhouette and contour representations are 
suitable for tracking complex no rigid shapes [5]. 

— Articulated shape models. Articulated objects are 
composed of body parts that are held together with joints. 
For example, the human body is an articulated object with 
torso, legs, hands, head, and feet connected by joints. The 
relationship between the parts is governed by kinematic 
motion models, for example, joint angle, etc. In order to 
represent an articulated object, one can model the constituent 
parts using cylinders or ellipses as shown in Figure 1(e). 

— Skeletal models. Object skeleton can be extracted by 
applying medial axis transform to the object silhouette [6]. 
This model is commonly used as a shape representation for 
recognizing objects [7]. Skeleton representation can be used 
to model both articulated and rigid objects (see Figure 1(f). 
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Fig 1. Object representations, (a) Centroid, (b) multiple points, (c) 
rectangular patch, (d) elliptical patch, (e) part-based multiple 
patches, (f) object skeleton, (g)complete object contour, (h) 
control points on object contour, (i) object silhouette. 

B. Feature Selection For Tracking 

Selecting the right features plays a critical role in tracking. 
In general, the most desirable property of a visual feature is 
its uniqueness so that the objects can be easily distinguished 
in the feature space. Feature selection is closely related to 
the object representation. For example, color is used as a 
feature for histogram-based appearance representations, 
while for contour -based representation, object edges are 
usually used as features. In general, many tracking algorithms 
use a combination of these features. The details of common 
visual features are as follows. 

— Color. The apparent color of an object is influenced 
primarily by two physical factors, 

1) the spectral power distribution of the illuminant and 2) the 
surface reflectance properties of the object. In image 
processing, the RGB (red, green, blue) color space is usually 
used to represent color. However, the RGB space is not a 
perceptually uniform color space, that is, the differences 
between the colors in the RGB space do not correspond to 
the color differences perceived by humans [8]. Additionally, 
the RGB dimensions are highly correlated. In contrast, L'Vv" 
and L"a"b" are perceptually uniform color paces, while HS V 
(Hue, Saturation, Value) is an approximately uniform color 
space However, these color spaces are sensitive to noise [9]. 
In summary, there is no last word on which color space is 
more efficient, therefore a variety of color spaces have been 
used in tracking. 

— Edges. Object boundaries usually generate strong changes 
in image intensities. Edge 

detection is used to identify these changes. An important 
property of edges is that they are less sensitive to illumination 
changes compared to color features. Algorithms that track 
the boundary of the objects usually use edges as the 
representative feature. Because of its simplicity and accuracy, 
the most popular edge detection approach is the Canny Edge 
detector [10]. An evaluation of the edge detection algorithms 
is provided by [1 1]. 



— Optical Flow. Optical flow is a dense field of displacement 
vectors which defines the translation of each pixel in a region. 
It is computed using the brightness constraint, which assumes 
brightness constancy of corresponding pixels in consecutive 
frames [12]. Optical flow is commonly used as a feature in 
motion-based segmentation and tracking applications. 

— Texture. Texture is a measure of the intensity variation of a 
surface which quantifies properties such as smoothness and 
regularity. Compared to color, texture requires a processing 
step to generate the descriptors. There are various texture 
descriptors: 

Gray-Level Co occurrence Matrices (GLCM's) [13] (a 2D 
histogram which shows the co occurrences of intensities in a 
specified direction and distance),Law's texture measures [14] 
(twenty-five 2D filters generated from five ID filters 
corresponding to level, edge, spot, wave, and ripple), 
wavelets [15] (orthogonal bank of filters), and steerable 
pyramids [16]. Similar to edge features, the texture features 
are less sensitive to illumination changes compared to 
color.. Mostly features are chosen manually by the user 
depending on the application domain. However, the problem 
of automatic feature selection has received significant 
attention in the pattern recognition community. Automatic 
feature selection methods can be divided into filter methods 
and wrapper methods [17]. The filter methods try to select 
the features based on a general criteria, for example, the 
features should be uncorrelated. The wrapper methods select 
the features based on the usefulness of the features in a 
specific problem domain, for example, the classification 
performance using a subset of features. 

Among all features, color is one of the most widely used 
feature for tracking. Despite its popularity, most color bands 
are sensitive to illumination variation. Hence in scenarios 
where this effect is inevitable, other features are incorporated 
to model object appearance. Alternatively, a combination of 
these features is also utilized to improve the tracking 
performance 

Table I. Object Detection Categories 



Categories 


Repres entativ e^ ork 


Point Detectors 


Morav e: ' s detector [24 \ 




Harris detector [ 1% 




Scale Invariant Feature Transform [21 ]. 




Affine Invariant Point Detector [25]. 


Segmentation 


Man-shift [371 




Graph-cut [23]. 




Active contours [25]. 


Ba:k ground 


Mixture o f Gaus sians[2 6] . 


Modeling 


EiEenbackEround[2E] 




WaUtlQwer[29] 




Dynamic teic1urebackEround[30]. 


Supervised 


Support Vector Machines [31 J. 


Classifiers 


Neural Networks [32], 




Adaptiv e Bo o sting [33] . 
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EI. OBJECT DETECTION 

Every tracking method requires an object detection 
mechanism either in every frame or when the object first 
appears in the video. A common approach for object detection 
is to use information in a single frame. However, some object 
detection methods make use of the temporal information 
computed from a sequence of frames to reduce the number of 
false detections. This temporal information is usually in the 
form of frame differencing, which highlights changing regions 
in consecutive frames. Given the object regions in the image, 
it is then the tracker's task to perform object correspondence 
from one frame to the next to generate the tracks. 

A. Point Detectors 

Point detectors are used to find interest points in images 
which have an expressive texture in their respective localities. 
Interest points have been long used in the context of motion, 
stereo, and tracking problems. A desirable quality of an interest 
point is its invariance to changes in illumination and camera 
viewpoint. In the literature, commonly used interest point 
detectors include Moravec's interest operator [18], Harris 
interest point detector [19], KLT detector [20], and SIFT 
detector [21] as illustrated in figure 2. 




(a) M (c) 



Fig 2. Interest points detected by applying (a) the Harris, (b) the 
KLT, and (c) SIFT operators 

B. Background Subtraction 

Object detection can be achieved by building a 
representation of the scene called the background model and 
then finding deviations from the model for each incoming 
frame. Any significant change in an image region from the 
background model signifies a moving object. The pixels 
constituting the regions undergoing change are marked for 
further processing. Usually, a connected component algorithm 
is applied to obtain connected regions corresponding to the 
objects. This process is referred to as the background 
subtraction. 

For instance, Stauffer and Grimson [21] use a mixture of 
Gaussians to model the pixel color. In this method, a pixel in 
the current frame is checked against the background model 
by comparing it with every Gaussian in the model until a 
matching Gaussian is found. If a match is found, the mean 
and variance of the matched Gaussian is updated, otherwise 
a new Gaussian with the mean equal to the current pixel color 
and some initial variance is introduced into the mixture. Each 
pixel is classified based on whether the matched distribution 
represents the background process. Moving regions, which 
are detected using this approach, along with the background 
models are shown in Figure 3. 
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Fig 3. Mixture of Gaussian modeling for background subtraction, (a) 
Image from a sequence in which a person is walking across the scene, 
(b) The mean of the highest-weighted Gaussians at eachpixels position. 
These means represent themosttemporallypersistent per-pixel color 
and hence shouldrepresentthe stationary background, (c) The means 
of the Gaussian with the second-highest weight; these means represent 
colors that are observed less frequently, (d) Background 
subtractionresult. The foreground consists of the pixels in the current 
frame that matched a low-weighted Gaussian. 

Another approach is to incorporate region-based (spatial) 
scene information instead of only using color-based 
information. Elgammal and Davis [22] use nonparametric 
kernel density estimation to model the per-pixel background. 
During the subtraction process, the current pixel is matched 
not only to the corresponding pixel in the background model, 
but also to the nearby pixel locations. Thus, this method can 
handle camera jitter or small movements in the background. 
Li and Leung [2002] fuse the texture and color features to 
perform background subtraction over blocks of 5 x 5 pixels. 
Since texture does not vary greatly with illumination changes, 
the method is less sensitive to illumination. Toyama et al. 
[1999] propose a three-tiered algorithms to deal with the 
background subtraction problem. In addition to the pixel- 
level subtraction, the authors use the region and the frame- 
level information. At the pixel level, the authors propose to 
use Wiener filtering to make probabilistic predictions of the 
expected background color. At the region level, foreground 
regions consisting of homogeneous color are filled in. At the 
frame level, if most of the pixels in a frame exhibit suddenly 
change, it is assumed that the pixel-based color background 
models are no longer valid. At this point, either a previously 
stored pixel-based background model is swapped in, or the 
model is reinitialized. The foreground objects are detected 
by projecting the current image to the eigenspace and finding 
the difference between the reconstructed and actual images. 
We show detected object regions using the eigenspace 
approach in Figure 4. 
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Figure 4. Eigenspace decomposition-based background ubtraction 
(space is constructed with objects in the FOV of camera): (a) an 
input image with objects, (b) reconstructed image after projecting 
input image onto the eigenspace, (c) difference image. Note that 
the foreground objects are clearly identifiable. 

-,cACEEE 



ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 



C. Segmentation 

The aim of image segmentation algorithms is to partition 
the image into perceptually similar regions. Every 
segmentation algorithm addresses two problems, the criteria 
for a good partition and the method for achieving efficient 
partitioning [23]. 

1. Mean-Shift Clustering. 

For the image segmentation problem, Comaniciu and Meer 
[2002] propose the mean-shift approach to find clusters in 
the joint spatial color space, [/ , u, v, x, y], where [I , u, v] 
represents the color and [x, y] represents the spatial location. 
Given an image, the algorithm is initialized with a large number 
of hypothesized cluster centers randomly chosen from the 
data. Then, each cluster center is moved to the mean of the 
data lying inside the multidimensional ellipsoid centered on 
the cluster center. The vector defined by the old and the new 
cluster centers is called the mean-shift vector. The mean- 
shift vector is computed iteratively until the cluster centers 
do not change their positions. Note that during the mean- 
shift iterations, some clusters may get merged. In Figure 5(b), 
we show the segmentation using the mean-shift approach 
generated using the source code available at Mean Shift 
Segments 

2. Image Segmentation Using Graph-Cuts. 

Image segmentation can also be formulated as a graph 
partitioning problem, where the vertices (pixels), V = { u, v, . . 
. }, of a graph (image), G, are partitioned into ,/V disjoint sub- 
graphs (regions), 

Ai , _Ni = 1 Ai = V, Ai CAJ = 0,i_=j\ 
by pruning the weighted edges of the graph, t he total weight 
of the pruned edges between two sub graphs is called a cut. 
The weight is typically computed by color, brightness, or 
texture similarity between the nodes. Wu and Leahy [1993] 
use the minimum cut criterion, where the goal is to find the 
partitions that minimize a cut.In their approach, the weights 
are defined based on the color similarity. One limitation of 
minimum cut is its bias toward over segmenting the image. 
This effect is due to the increase in cost of a cut with the 
number of edges going across the two partitioned segments. 




Fig 5. Segmentation of the image shown in (a), using mean-shift 
segmentation (b) and normalized cuts (c) 

IV. OBJECT TRACKING 

The aim of an object tracker is to generate the trajectory of an 
object over time by locating its position in every frame of the 
video. Object tracker may also provide the complete region 
in the image that is occupied by the object at every time 
instant. The tasks of detecting the object and establishing 
©2012 ACEEE 
DOI:01.JJSJP.03.01.84 



correspondence between the object instances across frames 
can either be performed separately or jointly. In the first case, 
possible object regions in every frame are obtained by means 
of an object detection algorithm, and then the tracker 
corresponds objects across frames. In the latter case, the 
object region and correspondence is jointly estimated by 
iteratively updating object location and region information 
obtained from previous frames. In either tracking approach, 
the objects are represented using the shape and/or 
appearance models described in Section 2. The model selected 
to represent object shape limits the type of motion or 
deformation it can undergo. For example, if an object is 
represented as a point, then only a translational model can 
be used. In the case where a geometric shape representation 
like an ellipse is used for the object, parametric motion models 
like affine or projective transformations are appropriate. 
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Fig 6. Taxonomy of tracking method. 

In view of the aforementioned discussion, we provide 
taxonomy of tracking methods in Figure 6. Representative 
work for each category is tabulated in Table II. We now briefly 
introduce the main tracking categories, followed by a detailed 
section on each category. 

Table. II 



Categories 


Representative tVcrk 


Point TrackinE" 


D etenuimstic methods 


MGEtracker[35]. GO A tracker 


Statistical methods 


[34]. 




Kihum :ilter[36]. 




PMHT[36]. 


Kernel Trackius" 


Template and density based 


Mean-shtft[4] : KLT[20] : 


appearance mc dels 


Layering [37]. 


Multi-view appearance mcdels 


Ei^entrackins [lack andjepcn 




199S]; 




SVM tracker [33]. 


Suhouete Traxking 


Contour evolution 


State space models [39]. 




Variatinal methods [40], 




Heuristic methods [41] . 


Matching shapes 


Hausdorfl[42] 


Histogram[43]. 
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— Point Tracking. Objects detected in consecutive frames 
are represented by points,and the association of the points 
is based on the previous object state which can include ob- 
ject position and motion. This approach requires an external 
mechanism to detect the objects in every frame. An example 
of object correspondence is shown inFigure 7(a). 




« W H tt 

Fig 7. (a) Different tracking approaches. Multipoint 
correspondence, (b) parametric transformation of a rectangular 
patch, (c, d) Two examples of contour evolution. 

— Kernel Tracking. Kernel refers to the object shape and 
appearance. For example, the kernel can be a rectangular 
template or an elliptical shape with an associated histogram. 
Objects are tracked by computing the motion of the kernel in 
consecutive frames (Figure 7(b)). This motion is usually in 
the form of a parametric transformation such as translation, 
rotation, and affine. 

— Silhouette Tracking. Tracking is performed by estimating 
the object region in each frame. Silhouette tracking methods 
use the information encoded inside the object region. This 
information can be in the form of appearance density and 
shape models which are usually in the form of edge maps. 
Given the object models, silhouettes are tracked by either 
shape matching or contour evolution (see Figure 7(c), (d)). 
Both of these methods can essentially be considered as object 
segmentation applied in the temporal domain using the priors 
generated from the previous frames. 

Conclusion 

An extensive survey of object detection and tracking 
methods is presented in this paper. Recognizing the 
importance of object shape representations for detection and 
tracking systems, we have included discussion on popular 
methods for the same. A detailed summary of criteria for 
feature selection , object tracking methods is presented which 
can give valuable insight into this important research topic. 
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